{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 线性回归\n",
    "1. 给定数据集 $D=\\{(x_1,y_1),(x_2,y_2),...,(x_n,y_n)\\}$，其中 $x_i=(x_{i1};x_{i2};...;x_{id})$ ，线性回归模型试图预测：\n",
    "$$\\hat{y}=\\theta_0+\\theta_1x_1+\\theta_2x_2+\\cdots+\\theta_nx_n$$\n",
    "其中 $\\theta_0$:偏差;$\\theta_1,\\theta_2,\\cdots,\\theta_n$:特征权重;使得预测值 $\\hat{y}$ 与真实值 $y$ 尽可能的相近\n",
    "\n",
    "    - 其向量化形式：$$\\hat{y}=\\theta^T\\cdot{X}$$\n",
    "\n",
    "2. 线性回归模型的 MSE 损失函数:\n",
    "$$MSE(X,h_{\\theta})=\\frac{1}{m}\\sum_{i=1}^{m}(\\theta^T\\cdot{X^{(i)}}-y^{(i)})^2$$\n",
    "3. 最小化损失函数，其完全解如下，求解时间复杂度 $O(n^{2.4})-O(n^3)$：\n",
    "$$\\hat{\\theta}=(X^{T}\\cdot{X})^{-1}\\cdot{X}^{T}\\cdot{y}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "X = 2 * np.random.rand(100, 1)\n",
    "y = 4 + 3 * X + np.random.randn(100, 1)\n",
    "plt.figure()\n",
    "plt.scatter(X, y)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[3.99641371],\n",
       "       [3.04206717]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 完全解\n",
    "x_b = np.c_[np.ones((100, 1)), X]\n",
    "theta_best = np.linalg.inv(x_b.T.dot(x_b)).dot(x_b.T).dot(y)\n",
    "\n",
    "theta_best"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 3.99641371],\n",
       "       [10.08054806]])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x_new = np.array([[0], [2]])\n",
    "x_new_b = np.c_[np.ones((2, 1)), x_new]\n",
    "y_predict = x_new_b.dot(theta_best)\n",
    "y_predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAD2CAYAAADcUJy6AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4nO3dfXxU1b3v8c9KQgBRASM+QsQHigqID1EYERgJWL3XHlvRS09VpL0aa20rp7Utantsi4Vz+zqn11bb0xNbe21v2/uy9aG3vUU9BgZQEyFYRMDnB2wEFBFFVJKQrPvHmmFCMklm9uw9s2fn+369eCXZmcz+sSHfWfNba+9trLWIiEj0lBW7ABERCYYCXkQkohTwIiIRpYAXEYkoBbyISERVBL2Dww8/3I4dOzbo3YiIRMq6devesdaOyuc5Ag/4sWPH0tzcHPRuREQixRizJd/nUItGRCSiFPAiIhGlgBcRiajAe/AiUpra29tpaWlh7969xS4l0oYMGcLo0aMZNGiQ78+tgBeRjFpaWjjkkEMYO3YsxphilxNJ1lp27txJS0sLxx9/vO/PrxaNiGS0d+9eqqqqFO4BMsZQVVUV2LskBbyI9ErhHrwgj7ECXkQkohTwIhJK3/3udznllFOYMWMGtbW1bN261dNzJBKJ/V8vXLiw359Zv34969ev77E9m58NGwW8iITWrbfeyqpVq/j85z/PnXfemffz3XHHHf0+preAz+ZnwyarVTTGmCOBP1prp3fZNhH4n9baOUEVJyIhsXAhZAi9vJx+OmQZmrt27WLo0KHE43HOPvtsNmzYwCOPPMJHH33E/Pnzefvtt5k0aRI//elP2bVrF5dffjkdHR1Ya4nH4/ufJx6P7x/R7927lwULFtDS0sKIESO47777WLx4MQ8++CAAv/nNb2hoaMj4s62trSxYsICtW7cyevRofvWrX7FkyRLa29tZvXo1u3fv5uGHH2b48OFcfvnl7N69m6qqKv7whz9QUVG4xYv9juCNMSOBe4FhXbYZ4EeA/ws3RUSSfvCDHzBjxgyampq48cYbaWpqIhaL8cgjjwBQX1/PxIkTWbVqFdu2bWPDhg3U19dz8cUXs2LFij7XltfX1zN58mQef/xx5s6dy8aNG1m6dCmLFi1i0aJFB4R7d3fffTcTJ05k5cqVjBs3jnvuuQeAl19+mVWrVnHppZeyfPlyNm/eTFlZ2f53IXv27PH3APUjm5eSDmAe8Kcu2z4PrAA+GURRIhIyRWpP3HrrrVx55ZX7v544cSKXXnrp/q9feOEFnnzySRKJBO+99x5vvvkmr732GvPmzQOgpqam1+d+/vnnmTt3LgALFizIqa7Nmzfvr2Pq1KksW7aMkSNHMn/+fACqq6tpa2vjzDPPZOLEiVxwwQWMGzeOCy+8MKf95KvfEby1dre19v3U18aYKuBK4F97+xljTJ0xptkY07xjxw5/KhWRAe/ggw8+4Ovx48ezcOFCEokEt99+O9XV1VRXV7Np0yaAjL30lJNPPpm1a9cCsGTJEn7xi18AMHToUD766CPAnYiUyYQJE2hqagKgqamJCRMmADBs2LADHvfMM88wbdo0Hn30UXbt2sXq1atz/Svnxcsk678AN1tr23t7gLW23lpbY62tGTUqr8sZi4j06tprr2XZsmXMmDGDn//854wZM4a6ujruv/9+4vE4u3fv7vNnn376aeLxOE8//TRXXXUVAHPmzOGBBx5g2rRpvQbyNddcw6ZNm5gxYwYvvfRSr+8Axo4dy09+8hPOPfdctm/f3uc7iiCY3l6hejzQmIS1Nm6MeRFIrVc6HbjLWvvt3n6upqbG6nrwIqXnueee45RTTil2GQNCpmNtjFlnrc3rFSHn6Vxr7Se6FJDoK9xFRKR4sm7RWGvj2WwTkejI9h2+eBfkMdaJTiKS0ZAhQ9i5c6dCPkCpq0kOGTIkkOfX5YJFJKPRo0fT0tKCVsIFK3U9+CAo4EUko0GDBgVyjXIpHLVoREQiSgEvIhJRCngRkYhSwIuIRJQCXkQkohTwIiIRpYAXEYkoBbyISEQp4EVEIkoBLyISUQp4EZGIUsCLiESUAl5EJKIU8CIiEaWAFxGJKAW8iEhEKeBFRCJKAS8iElFZBbwx5khjzOrk59XGmIQxZrkxpt4YY4ItUUREvOg34I0xI4F7gWHJTdcB11trZwFjgEnBlSciIl5lM4LvAOYBuwGstbdaa59Lfq8KeCeg2kREJA/9Bry1dre19v3u240x84BN1tqtGb5XZ4xpNsY079ixw6dSRUQkF54mWY0xJwA3AQszfd9aW2+trbHW1owaNSqf+kRExKOcAz7Zk/898IVMI3sREQkHLyP4RUA1cGdyNc1Mn2sSEREfVGT7QGttPPnxW8C3gipIRET8oROdREQiSgEvIhJRCngRkYhSwIuIRJQCXkQkohTwIiIRpYAXEclRYyMsXeo+hlnW6+BFRMSFem0ttLVBZSU0NEAsVuyqMtMIXkQkB4mEC/eODvcxkSh2Rb1TwIuI5CAedyP38nL3MR4vdkW9U4tGRCQHsZhryyQSLtzD2p4BBbyISM5isXAHe4paNCIiEaWAFxHJQ5iXTKpFIyIDXmOjt556aslka6ubdL3rLqirC6rK3CngRWRAy2ddeyLhwr2z0/254QaYNCk8/Xm1aERkQMtnXXs87kbuKZ2d4VoXr4AXkQEtn3XtsZhry1RUQFkZDB4crnXxatGIyICW77r2ujrXlgnjunhjrQ10BzU1Nba5uTnQfYhI+HmdyByojDHrrLU1+TyHRvAiErhSukBXlKgHLyKBK6ULdIViXftHH/nyNFmN4I0xRwJ/tNZON8YMAh4ADgN+aa29x5dKRCSyUhOZqRF8mCYiuyraO419+6C52e3wscfgySd9edp+R/DGmJHAvcCw5KavAOustdOAy4wxh/hSiYhEVmoic/HicLdnCvZOw1rYvBl+8hO45BKoqqIx9k8s/fYeGt+shq98xZfdZDOC7wDmAX9Kfh0HFiU/XwXUACu6/oAxpg6oA6iurvajThEpcWG7QFemSd9A32n8/e/u1S31Z9s2t/3EE2k8/xZql32dto5yKlsMDXOBf/u3vHfZb8Bba3cDGGNSm4YBbyY/fxc4MsPP1AP14FbR5F2liAwYQa62ST13VRUsXNizFePrpYB37YIVK9JtlxdfdNtHjXJ9oNmz3cexY0kshba/+P/Owcsqmj3AUOB94ODk1yIieQuyB971ucvKXJh2dqYDNbUfz+80Pv4YnnjChXlDA6xb51oxw4bBzJnwxS+6AiZOdAV0EdQ7By8Bvw44D/gjMBlo8qcUEfFLqa45z9QD96v+rs9trctYY3IL1AOO69n7XIinWi5PPOEuTFNR4Yq+7TYX6Oec43bSh6BuIuIl4O8F/mqMmQ6cCjzlTyki4odSXnMeZA+8+3PfcQfs3Jl9oNbXW758A3R0WAabdhqGXkzsw8fcNydPdlcamz0bpk+Hgw/Oub4g5iiyDnhrbTz5cYsxZg5uFP/P1toOf0sSkXwEOQoOWrYjWS/vUDyNkltaoKGBxv+zhRsevpl9VABltNpyEqdeT+zr18D558MRR2RXRIF5OpPVWrsVuM/nWkTEB6Wy5rw3/Y1k83mHknpcahKzx8/t2uW+mZoYfeEF9/iDFtNJGWAAS/mgCuI/vpRGIPHL8LbCdKkCkYgppZtCexmJ5/MOpceLw19biXU8ng70devczOuwYTBjBlx7LcyeTXzPJAbPKaO1FcrKDHfd5Z7Pr1ZYUHMmCniRCArbmvNMvI7E83mHkljeSVuroaPT0PbxPhJzbie273Y3MTplCnznO66oKVMOmBiN0fNFc+lSf1phQc6ZKOBFpCi8jsS7vkOpquqj3QJuucwLL+xf6RL/zz1Udj5EG4OoNPuIzz0crvqLG60f0vdJ+d1fNP1qhQU5Z6KAF5GiyCcgUwGYceS7dWu65dLQAG8mz8s87jhi82bTUP0EidYY8YsPJha70XP9frXCgpwz0fXgRaRocu09d318IuE6Kh0dUF7WyaeO28BHb+9h7of3Uscv3PB+1qz0GaMnnOAWvodQpuPgx/XgFfAiJapUT2byKt2rtlRWdHLHrP/Lwkcuoq2zHEMn+0j3zP/j1jeo+/6YHmeMlhLd8ENkAOka6FC6JzPlrKMD/vY3Et/7kLaPz6ODcto6Otn5cDMNE5eROPxyHto+hTXPVeKWMcL9a4+jrnSz3TcKeJES0H2lxdVXh+NkpkDeRVgLL72U7qGvWAG7dhFnKpVmOW1UUjmojPj/u4XY7GHEgKp6WHNd+inmzvWplhKngBcpAd1XWkDxT2bydXnftm3pa7o89pg7gxSguho+8xmorSU2axYNrw3t8oIybP+P19W5j/ff78J90iS4/nq3bf78CL+76YcCXqQEdF9pMX++++N19OzHyDuv5X3vvw8rV6YDffNmt/2ww9zEaOpyuieeeMDEaOyo3vdRV+f+NDa6qwe0trrt99xTWpdr8JMCXqQE9LYkr5gn1uS0vK+11e041XZZu9a9Mgwd6i7OtWCBK+r00/OeGE298KS0tyvgRSTk/Do71a8Ta/pcB97RAevXp9suq1e766WXl7vL5958swv0WAwGD87/L9VF6oUnNYIfNKj0rsfjFwW8yADj54k1+190rIWXXk63XFasgHffdQ+aMGH/NV2YORMOPdSHv0XfNa1YAb/+tfu6VHrwQUxYK+BFBhjfLka2fTssX55uu7zxhts+Zoy7kXRtreunH320p6fPJ/C8vNsp5nkFmdpmflDAiwxAnto9u3enJ0YbGmDjRhqZSmLIRcSnXkFs0Rg3Sj/ppP0To42NkPhfuYdmoW9aUuybpGRqm/lBAS8imbW2QlNTuu2yZo1LoCFDYPp0Gqd/k9p7rqCt3VD5lKFhCcTGpX88n9As9E1Lin2TlDDdk1VEoqizE555Jt1yWbXKTYyWlcHZZ8OiRemJ0SFDSCyFtn29h2I+odl1orSszF1WJkjFvklKmO7JKiJRYC288kq65bJ8ubtJKcCpp8I117hAnzkTRozo8eP9hWK+V4u84w53m9OODli40J28FNSoupA3Semt11/Ue7KKhM1Au9hWb3I6Dm+9deDE6JYtbvvo0XDxxa6HPmsWHHNMv/vNJhSvvtp99LKSZedO9xrU2VmYtkkhbpJS6F5/zgFvjBkJ/BY4Alhnrb2unx8R8V2xJ8VyEeQLUb/H4YMPXKslFejPPuu2jxjhTvf85jfdE3ziE54updtbKHav64wzcj8GxW6bBKHQvX4vI/irgN9aa39rjPmdMabGWqvrAUtBFXtSLFtBvxD1OA4N+4i1P5luuzz1FOzb5yZGzzsPPvc5N0o/4wx30lFAutbV2gpf+pIbiVdWujXqud65KSrv0gr9ouUl4HcCE40xI4AxwN/9LUmkf6Uyugv6hSg+o5PKCmjrtFTaduK3XwStCTczWVMD3/iGC/Rzz3UhXyBd/33A/f3Bhf0Pf+hOZs0mtEvh3rK5KPSLVs43/DDGHAcsBZ4HRgM3WGvbuz2mDqgDqK6uPmtLqs8n4qN87gZUqNBIjeBbW92A+a670lc+9OzVV9NLF5cvp/Gdk0gQJ37c68Q+dbjbYTyecWK0kFLHe80aeOih9PbUG4ewt9aKrSh3dDLG3AMstNbuNsZ8Ddhjra3v7fG6o5OEgddWiR8vCvX1bjVIZ6e77ErOofb2225iNBXqr7/uth9zTPp2dLW1cOyx3goMWGOjO37t7e6NRWritLwcFi92l6WRnop1R6eRwCRjTBMwBXgsnwIkfKK4OsVLq8Sv/nnOq0H27HETo6lA37DBbR8+3E2Mfv3rLtjHjw/tPUa7isXc3zmRcOvZFy4Mf2stKrwE/FLgV8BxQCPwe18rkqIK8+qUfF54vPTs/eqf97vv9nY3GZpa6dLU5CZGBw+GadNgyRL3j3LmmVBRuJXN+Rzv7j/btZc+aVL0BhBhlfP/FmvtGmBCALVICHgNtaBH/fm+8HiZ3PJrIrfHvqd0wjPPple6rFwJH37oRuM1NXDTTemJ0aFDve00T/kc7/5+NmoTp2GmE53kAF5CrRCj/nxH015egPxc8RA76jVioxrgx25ilB073DfGj3dnA82e7XYycqT3nfgon+NdKktYBwIFvBzAS6gV4hc6n9F0Pi9AnkebO3a4Bd+ptsurr7rtRx8Nn/xkenJ09Oj9NSZ+Hp62RT7Hu1SWsA4ECnjpIddQC+oXuvuo2+touiAjyj173F2LUhOjzzzjth96qJsYXbjQhfrJJ/eYGA3jvEc+xzuKJyiVKgW85C2IX+jeQi/we4dmq73dLfBOBXpTk9tWWekmRm+/3QX6WWf1OzEa1pZGPr1y9dnDQQEvvvDrFzo1an/jDf9Cz5cXIGth48Z0y2XlSjdqN8atbvna19wr0rRpcNBBOT21WhoSFAW8hEbXUXt5eXrg60foeXoB2rIlHegNDe6EI4Bx4+Cqq1yx558Phx2Wd21qaUgQFPASGl1bFeDu01xdXcDQe+cdNzGaaru88orbftRRMGdOemJ0zBjfd62WhgRBAS+h0b1V4eUa4jn58EN4/PH0KH39eteKOeQQV8xXv+oC/dRTAz9jNIpnD0vxKeAlNLy2KrIOx/Z2WLs2PUJvbExPjJ57Lnz/+26UXlNT8DNGw7aKRqJBAS+hkmuros9wtBY2bUoH+sqV7gYYxrjroaeWLp53Xs4To7nq60UorKtopPQp4KWk9QjHh94j9vyD+y+ly/bt7oEnnQRXXJGeGM3iLs5+tU36G6FrFY0ERQEvfSpEbzivi4id8T6V5cNo6zRUdrYR/+FFQBMceWT6Mrq1tXDccTnX5FfbpL8RulbRSFAU8CWqUMEbdG8453189JGbGE22XWJ/+xsNdgqJyk8SP3sPscs/C7V3w4QJeU2M+tk2yWaErlU0EgQFfAkq1KRcIXrD/e5j3z5obk6vdHnySffAQYPcA7/3PWK1tcTOPttt80lVlXt9KCvLv22iEboUiwI+R2FYzlaoSblC9IZ77GOmhc3PpQM9kYDdu92DzzjDLV1MTYwOG+Z/Qbh/44UL03cduuOO/I+vRuhSDAr4HIRlOVuhJuVyHXl6viTv794i8b9biH/wZ2KX1cO2be6bJ54In/2sC/Tzz4fDD/f+l8lB6gW0s9ON4nfuLMhuRXyngM9BWJazFSJ4u+4r2xt+ZP3i9+67rqDkKD324ovEAI44AmbNSp8xOnZsbsX6RKtaJCoU8DnI95rkfrZ2AgnePPT54vfxxwdMjPL0026N+sEHw8yZ8MUvulCfODEU9xhVz1yiQgGfg3zOtCxWa6c4/XpLfNRmWPInF+hPPgmtre7s0FgMbrvNBfo55/g6Meon9cwlChTwOfLyi1/M1k5B2g3WEhvxPA1f3Eji4b3Et9xL7NoG973Jk+HLX3avcNOnu1G7iBSEAr4A8g3ZfHvogbQbWlrSl9FtaICtW4kBseOPhytmQ+21rp8+apRPO5QwrOCS0uI54I0xPwOWWWv/7GM9kZRPyPrR3vGl3bBr1wETo7zwgts+atSBE6PHH5/njiSTsKzgktLiKeCNMdOBoxTu2fMaskG1d/odDX78seudpwJ93Tq3bnDYMJgxA+rqXOJMmuTOBpJAhWUFl5SWnAPeGDMIuBv4qzHmEmvtn/wvS1KC6KFnHA2e0+FWt6QC/fHH0xOjU6fCd76TnhitrMy/CMmJlm6KF15G8POBzcAPga8YY6qttXd2fYAxpg6oA6iurs67yIGsv/aOl76sGw1aOjoMba2dJK75LbGtX4X33nMPOO00+NKXXKBPn+5ugOET9ZG90dJN8cJYa3P7AWPuAv5irX3YGHMK8ANr7aW9Pb6mpsY2NzfnWWZasQOi2PvvXktOfdmtW6Ghgcbfv0btw9+gzVZQSTsNR11J7OIq92SzZrkTjsJQr8gAZoxZZ62tyec5vIzgXwZOSH5eA2zJp4BcFDsgir3/7vrty773ntuYOsHo+ecBiFVV0XB+G4lD/4H4FccSu+yBQOtMvSi+8Yb6yCKF5CXgfwncY4z5LDAIuMzfknpX7ImmRMK1pTs73cdiB1SPvmysFZY/ke6jNze7Yg86yE2MXnONe4U67TRiZWUUovSuL4oVFe7iXaA+skgh5Bzw1toPgMsDqKVfxZ5oqqpyeQnuYxY3BQpU7JwOGu56kcR9bxHf+QCxi+6GvXtdik6ZAt/+tkvXqVOLNjHa9UUZ4Nprobo6HC0uv4SpbSfSVUmd6BT0RFN9Pdx/P8yd61YBdrdzp1sR2NnpPhb8KoPWwksvpUfoK1YQ27XLjcQnTXLXdKmtdaP1Qw8tcHGZdX9Rnj8/WiEYtradSFclFfAQ3DVC6uvhuuvc548+6j52D/l4HAYP9u8dRFYjv23b0j30hgZ3Bim4YfBnPuNWusya5W5RF0JhX/2R7+i72G1Dkb6UXMAH5f77e37dPeD9DKteR37vvw8rV6YDffNm9wOHHXbgPUZPPDEUV17MRlgv3OXH6LvYbUORvijgk+bOTY/cU19n4ldYHTDya+0kcctjxD7+Z1i71vWAhg51rZYFC9woffJknTHqMz9G32F/hyIDWygDvlCTVl33kxqt99WD90VHB6xfT7zlOSrtZbRRTmVnO/FV34cpZXDLLS7Qp051/aCQisLEol+j77C+QxHJ+USnXOV6olOhJq0KNjlmLbz88gETo7z7rqvh+M+ROPofiX96BLHrTgvNxGh/+jp2pRb8pVavDBzFOtEpUIWatAp0P9u3py+j+9hj8Pe/u+1jxsAll9BYPY9EW4z4pw7l5hIMld6OXSmuKNHoW6IsdAFfqEkrX/eze/eBE6ObNrnthx3mbhZ9yy0u+U46icYmkw7BO0ojBLvr7dh1D/5f/1qjY5FiCl3AF2rSKq/9tLZCU1M60Nescak2ZIi7ONf8+S7QTz89fepmUhSW1fV27LoGf0UF3HOP+3uWymheJGpC14MPpc5OWL8+3XZZtcpdL72szF0+N7V0MRZzId+HUmxj5KLrdWfuvtsFfHk5LF4MN99c7OpESkcke/ChYC288kq6h75iRfq01VNPddd0mT0bZs6E4cNzeuq+3jlEYcIv1dNubIR779X6cJFiUsCnvPXWgfcY3ZK8SObo0fCpT6UvpXvMMXnvKtPEXtRG9lofLlJ8AzfgP/jATYymRukbN7rtI0a4IP/mN90ofdy4gpwxGoXefHdaoSJSXJEO+ANaHme1uYnRVKCvWQP79rme+XnnwZVXuiH0GWf0mBgtBJ3yLiJ+i2zANz7RmW55mDYaBl1ErDXhJkZratwIvbYWzj2334nRQlBLQ0T8VrSAD2RC8dVX9y9dTPxlEm2ti+iggjZbQeKMfyL2rRvdDkeM8GmH/lJLQ0T8VJSA921C8e23Yfny9Hr011932489lvjMCVQ2QFuHpbKygviP/oGC3MKoD1FYJSMipaMoAe95QnHPHrcGPRXoGza47cOHuzNGb7rJvXKMH0/MGBpCFKhRWyUjIuFXlIDPekKxrQ2eeiq9dLGpyU2MDh7sJkaXLHGpedZZGSdGC9ny6G90HsVVMiISbkUJ+F4nFDs74dln0ytdVq2CDz90E6NnneVG6LNnu4nRoUMDrzPblko2o3OtkhGRQivaJOv+0fVrr8HdyZbL8uWwY4d7wPjx7mYXtbUuDUeOLGh9ubRUshmda5WMiBSa54A3xhwJPGytPSOnH9yxwwV5apT+2mtu+zHHwIUXpq/rMnq019J8kUtLJdvRuVbJiEgh5TOC/1eg/z5JZycsW5YO9GeecduHD6fxtOtITPw08X88mthnjwvVPUZzaalodC4iYeTpapLGmFnAfwNOttbG+3psTVmZbbbWpeS0aa6HXltLY9tZ1H6yItSrSrSsUUSKpShXkzTGVALfAT4DPNTLY+qAOoBTDj0U/vhHF+4HHbT/MYmlPVsgEK5AVUtFREqZlxbNIuBn1tr3TC8tFWttPVAP7nrwzJnT4zHdWyBVVaW9TlyjfREJGy8BPxuYZYy5ATjdGPMLa+01uT5J9751GNeJ+7lMUkSk0HIOeGvtjNTnxpiEl3BP6d4CCdM6cb+XSYqIFFpe6+D7m2DN1dVXu4/z5xc/IINYJikiUkihuFxw99Hy/PnensPPHriWSYpIqQtFwOfb4giiB55raGvFjYiETSgCPt8WR1A9cIW2iJSyUAR8vi0O9cBFRHoKRcBDfqNl9cBFRHoKTcDnS+0UEZEDlRW7ABERCYYCXkQkohTwIiIRpYAXEYkoBbyISEQp4EVEIkoBLyISUQp4EZGIUsCLiESUAl5EJKIU8CIiEaWAFxGJKAW8iEhEhTrgGxth6VL3UUREchPaywUHcRs+EZGBJOcRvDFmuDFmmTHmUWPMg8aYyiAKy3QbPhERyZ6XFs0VwI+stRcA24EL/S3JSd2Gr7xct+ETEfEi5xaNtfZnXb4cBbztXzlppXAbvsbGcNcnIgOb5x68MSYGjLTWNmX4Xh1QB1BdXe25uDDfhk9zBCISdp5W0RhjDgPuBL6Q6fvW2nprbY21tmbUqFH51BdamiMQkbDzMslaCfwBuNlau8X/kkqD5ghEJOy8jOD/O3AmcKsxJmGMmedzTSUhNUeweLHaMyISTsZaG+gOampqbHNzc6D7EBGJGmPMOmttTT7PEeozWUVExDsFvIhIRCngRUQiSgEvIhJRCngRkYhSwIuIRJQCXkQkohTwIiIRpYAXEYkoBbyISEQp4EVEIkoBLyISUQp4EZGIUsCLiESUAl5EJKIU8CIiEaWAFxGJKAW8iEhEKeBFRCJKAS8iElEKeBGRiFLAi4hElKeAN8b80hjTaIz5tt8FiYiIP3IOeGPMpUC5tTYGnGCMGed/WSIikq8KDz8TB+5Lfv4ocB7wUtcHGGPqgLrkl63GmI1eCyygw4F3il1EFlSnf0qhRlCdfiuVOsfn+wReAn4Y8Gby83eBM7s/wFpbD9QDGGOarbU1nissENXpr1KosxRqBNXpt1KqM9/n8NKD3wMMTX5+sMfnEBGRgHkJ53W4tgzAZCgvnAcAAARCSURBVOB136oRERHfeGnRPASsNsYcA1wETO3n8fUe9lEMqtNfpVBnKdQIqtNvA6ZOY63N/YeMGQnMAVZZa7fnW4SIiPjPU8CLiEj4aYJURCSi8gr4bM5ozfSYQp8J29/+jDHDjTHLjDGPGmMeNMZUGmMqjDFvGGMSyT+TQlBnxpqMMd8zxqw1xvw06BqzrPP6LjWuN8b8R5GO55HGmNV9fH+QMebPxpgnjDFf6G1bCOqsTh6z5caYeuMca4xp6XI8R4Wgzow1FeH3vb86v9elxueNMTcX8nhmypteHpd3dnoO+GzOaM30mEKfCZvl/q4AfmStvQDYDlwInAb83lobT/55NgR19qjJGHMWblXTOcDbxpjZxa7TWvvvqRqB1cDdmWoPuM6RwL248zZ68xVgnbV2GnCZMeaQXrYVu87rgOuttbOAMcAkYArwgy7Hc0cI6uxRUxF+3/ut01p7W5f/nxuBX2eqPcAyM+XNAfzKznxG8HF6ntGazWOy+Tk/9bs/a+3PrLX/mfxyFPA2bnXQxcaYNclXTS8rjnyts5eaZgL3WzeZ8ggwPQR1Am5EBxxprW2m8MezA5gH7O7jMXHSf5dVQE0v24LUb53W2luttc8lv6zCnYU5FbjGGPO0MWZJwDVCdsczU01xCvv7nk2dABhjzgZarLVvUsDj2UvedBfHh+zMJ+C7n9F6ZJaPyebn/JT1/owxMWCktbYJWAvMttaeAwwC/ksI6sxUU2iPJ3AD8O/Jzwt6PK21u6217/fzsKL//8yyTgCMMfOATdbarcAy3C/82UDMGHNacFVmXWemmkJ7PIEbgTuTnxf0eEKPvOnOl/+b+QR8Nme0ZnpMoc+EzWp/xpjDcP/Yqb7rBmvttuTnzUDQF1XLps5MNYX1eJYB5wOJ5KZCH89shOH/Z1aMMScANwELk5uetNZ+YK3tAP5GOI5npprCejxHAEdYa19Jbiro8cyQN9358n8zn4OdzRmtmR5T6DNh+91fcpLjD8DN1totyc2/McZMNsaUA58Gnil2nb3UFLrjmTQdeMqm1+EW+nhmIwz/P/uV7Cv/HvhCl9HpI8aYo40xBwEX4HrJxZapptAdz6RLgL92+bpgx7OXvOnOn/+b1lpPf4BDcb+kPwKeS+7w9n4eMzzTNq81+Fjn9cAu3GgzgevhTQQ2AM/iJl8CqzGHOnvUhHuRfgL4MfACcHyx60w+bglwaV+1F+IPkEh+nAV8udv3jgM2JY/dWqA807YQ1Pk/gG1d/n/OxL07ej55TL9ciBqzqLNHTYX+fc+mzuT23wFn9lV7gLV1z5vbMvyu+5KdeZ3oZLI4ozXTY7L5OT8Ven9eea3TGDMU+K/A09baV4Oqr8v+SuJ4ZsO4S26cBzxik6PjTNvEuyj9fykkP7JTZ7KKiERUKCY8RETEfwp4EZGIUsCLiESUAl5EJKIU8CIiEfX/AYzI3lBKMJndAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.plot(x_new, y_predict, 'r-', label='Predictions')\n",
    "plt.plot(X, y, 'b.')\n",
    "plt.axis([0, 2, 0, 15])\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([3.99641371]), array([[3.04206717]]))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 与完全解等价的 sklearn 的线性回归模型\n",
    "from sklearn.linear_model import LinearRegression\n",
    "lin_reg = LinearRegression()\n",
    "lin_reg.fit(X, y)\n",
    "lin_reg.intercept_, lin_reg.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "class sklearn.linear_model.LinearRegression(fit_intercept=True, \n",
    "                                            normalize=False, \n",
    "                                            copy_X=True, \n",
    "                                            n_jobs=None)\n",
    "```\n",
    "- 原理：最小二乘法\n",
    "- 模型初始化的参数\n",
    "    - `fit_intercept=True`：将数据中心化，每个属性的平均值都被设置为0；控制算法来包含一个回归常数，以适应反应变量的任何常数偏移。\n",
    "    - `normalize=False`：把数据的每个属性用其标准差进行标准化\n",
    "    - `copy_X=True`：设置为 False 时，X 变量会被覆盖\n",
    "    - `n_jobs=None`：使用的处理器的数量，None 表示 1，-1 表示使用所有处理器\n",
    "- 模型方法：\n",
    "    - `fit(self, X, y, sample_weight=None)`\n",
    "        - X,y 为训练样本及对应的目标值\n",
    "        - `sample_weight` 为每个样本的权重值\n",
    "    - `predict(self, X)` 进行预测\n",
    "    - `score(self, X, y, sample_weight=None)`\n",
    "        - 返回预测值的 R^2 系数\n",
    "        - R^2系数定义为$(1-u/v)$，其中 $u$ 是残差的平方和（$\\sum(y_{true}-y_{pred})^2$），$v$ 为 $\\sum(y_{true}-y_{true}.mean()))^2$\n",
    "        - 最佳的系数为 1.0， 此时 $y_{true}=y_{pred}$；系数也可能为负\n",
    "\n",
    "- 模型属性：\n",
    "    - `coef_`：系数矩阵\n",
    "    - `intercept_`：回归常数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 梯度下降\n",
    "- 最小化损失函数，而不是准确度等指标，是因为损失函数是关于权重的平滑函数，而后者不是\n",
    "- 梯度下降计算损失函数对参数向量的梯度，并向梯度下降的方向改变；当梯度为 0 时，到达最小值\n",
    "\n",
    "![](images/Gradient-Descent.png)\n",
    "- 梯度下降的步长是一个重要的参数，由超参数学习率(`learning_rate`)决定\n",
    "- `Batch Gradient Descent`，迭代训练时每一步使用所有训练数据\n",
    "    - 对单个参数的梯度：$$\\frac{\\partial}{\\partial \\theta_j} MSE(\\theta)=\\frac{2}{m}\\sum_{i=1}^{m}(\\theta^T\\cdot X^{(i)}-y^{(i)})x_{j}^{(i)}$$\n",
    "    - 损失函数对参数向量的梯度：$$\\bigtriangledown_{\\theta} MSE(\\theta)=\n",
    "\\begin{cases}\n",
    "\\frac{\\partial}{\\partial \\theta_0} MSE(\\theta)\\\\\n",
    "\\frac{\\partial}{\\partial \\theta_1} MSE(\\theta)\\\\\n",
    "\\ \\ \\ \\ \\vdots\\\\\n",
    "\\frac{\\partial}{\\partial \\theta_n} MSE(\\theta)\\\\\n",
    "\\end{cases} \n",
    " =\\frac{2}{m}X^T\\cdot{(X\\cdot{\\theta}-y)}$$\n",
    " 其中 $X_{m\\times n}\\qquad \\theta_{n\\times 1}\\qquad y_{m\\times 1}$\n",
    "    - 每一步参数更新为：$$\\theta^{i+1}=\\theta^i-\\eta \\bigtriangledown_\\theta MSE(\\theta)$$\n",
    "    - 将迭代次数设置非常大，但当梯度变得足够小时，终止训练\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eta = 0.1\n",
    "n_iterations = 1000\n",
    "m = 100\n",
    "\n",
    "theta = np.random.randn(2, 1)\n",
    "for iteration in range(n_iterations):\n",
    "    gradients = 2 / m * x_b.T.dot(x_b.dot(theta) - y)\n",
    "    theta = theta - eta * gradients\n",
    "theta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `Stochastic Gradient Descent`\n",
    "    - 随机梯度下降，每一步从训练数据中随机选择一个样本，用来计算梯度；\n",
    "    - 随机梯度下降算法，损失函数值上下波动，而不是稳定逐步减小到最小值；\n",
    "    - 当损失函数非常不平整，即有多个局部最小值时，随机梯度会跳出局部最小，更可能求得全局最小\n",
    "    - `Simulated annealing` 技巧，训练时逐渐减小学习率；开始时步长比较大，可以快速训练，跳出局部最小，然后逐渐变小\n",
    "    - `Early Stopping`技巧，当模型在验证集上的误差不再变小时，停止训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_epoches = 50\n",
    "t0, t1 = 5, 50\n",
    "\n",
    "\n",
    "def learning_schedule(t):\n",
    "    return (t0 / (t + t1))\n",
    "\n",
    "\n",
    "theta = np.random.randn(2, 1)\n",
    "\n",
    "for epoch in range(n_epoches):\n",
    "    for i in range(m):\n",
    "        random_index = np.random.randint(m)\n",
    "        xi = x_b[random_index:random_index + 1]\n",
    "        yi = y[random_index:random_index + 1]\n",
    "        gradients = 2 * xi.T.dot(xi.dot(theta) - yi)\n",
    "        eta = learning_schedule(epoch * m + 1)\n",
    "        theta = theta - eta * gradients\n",
    "\n",
    "theta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([3.11578522]), array([4.00746624]))"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 随机梯度回归模型\n",
    "from sklearn.linear_model import SGDRegressor\n",
    "sgd_reg = SGDRegressor(max_iter=50, penalty=None, eta0=0.1)\n",
    "sgd_reg.fit(X, y.ravel())\n",
    "sgd_reg.coef_, sgd_reg.intercept_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Early stopping 技巧\n",
    "\n",
    "from sklearn.base import clone\n",
    "from sklearn.linear_model import SGDRegressor\n",
    "sgd_reg = SGDRegressor(n_iter=1,\n",
    "                       warm_start=True,\n",
    "                       penalty=None,\n",
    "                       learning_rate='constant',\n",
    "                       eta0=0.0005)\n",
    "minimum_val_error = float('inf')\n",
    "best_epoch = None\n",
    "best_model = None\n",
    "for epoch in range(1000):\n",
    "    sgd_reg.fit(X_train_poly_scaled, y_train)\n",
    "    y_val_predict = sgd_reg.predict(X_val_poly_scaled)\n",
    "    val_error = mean_squared_error(y_val_predict, y_val)\n",
    "    if val_error < minimum_val_error:\n",
    "        minimum_val_error = val_error\n",
    "        best_epoch = epoch\n",
    "        best_model = clone(sgd_reg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `Mini-batch Gradient Descent`\n",
    "    - 每次在随机选择的小批量数据上计算梯度，并更新参数\n",
    "    \n",
    "几种线性回归算法的比较如下图所示：\n",
    "![](images/ComparisonLinearRegression.JPG)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 模型性能评估\n",
    "- 通过交叉验证(`cross-validation`)来评估模型的泛化性能\n",
    "- 学习曲线(`learning curves`)描绘模型在训练和测试集上的性能，与训练数据量的关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "\n",
    "def plot_learning_curves(model, X, y):\n",
    "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)\n",
    "    train_errors, val_errors = [], []\n",
    "    for m in range(1, len(X_train)):\n",
    "        model.fit(X_train[:m], y_train[:m])\n",
    "        y_train_predict = model.predict(X_train[:m])\n",
    "        y_val_predict = model.predict(X_test)\n",
    "        train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))\n",
    "        val_errors.append(mean_squared_error(y_val_predict, y_test))\n",
    "\n",
    "    plt.plot(np.sqrt(train_errors), 'r-+', linewidth=2, label='train')\n",
    "    plt.plot(np.sqrt(val_errors), 'b-', linewidth=3, label='val')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `The Bias-Variance Tradeoff`\n",
    "一个模型预测结果和真实值之间的偏差可以表述为三种不同的错误(`error`):\n",
    "- `Bias`：由于错误的假设，如将实际上二项式模型假设成线性模型；高`bias`模型，导致**欠拟合**问题\n",
    "- `Variance`：由于模型对训练数据中微小变化过度敏感；高阶的模型，如高自由度多项式模型，更有可能有高`variance`，导致**过拟合**训练数据\n",
    "- `Irreducible error`：由数据噪声导致，清洗数据来改善，如异常点检测、检查数据源等\n",
    "\n",
    "增加模型的复杂度，将增加`variance`减小`bias`；降低模型复杂，则会增加`bias`减小`variance`；这就是需要权衡的地方"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 欠拟合和过拟合\n",
    "- 过拟和是指模型对于训练数据拟合呈现过当的情况，反映到评估指标上，就是模型在训练集上表现很好，但在测试集上表现较差。欠拟合指的是在训练和预测时，表现都不好的情况\n",
    "- 过拟合主要由于模型学习能力过于强大，以至于把训练样本自身所包含的不太一般的特性都学习到了，即模型过于复杂、训练数据偏少。导致模型的泛化能力下降\n",
    "       \n",
    "         \n",
    "- 降低过拟合风险的方法：\n",
    "    - 获取更多训练数据。可以一定的规则来扩充训练数据，如对图像数据，旋转、平移、缩放等操作；进一步，也可以使用 `GAN` 合成大量的新数据\n",
    "    - 降低模型复杂度，如神经网络减少网络层数、神经元个数；决策树中，降低树的深度、进行剪枝等\n",
    "    - 正则化方法\n",
    "    - 集成学习方法，多模型集成，降低单一模型的过拟合风险\n",
    "- 降低欠拟合风险的方法：\n",
    "    - 添加新特征。当特征不足或现有特征与样本标签的相关性不强时，容易出现欠拟合\n",
    "    - 增加模型复杂度，简单模型的学习能力较差\n",
    "    - 减小正则化系数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 正则化\n",
    "在训练线性回归模型时，有的系数取值很大，且包含大量相关联的特征时，数据微小的变化就会导致模型很不稳定；通过加入正则化，控制与变量关联的权重，从而将不必要的变量的系数大幅惩罚降低。\n",
    "- 岭回归(`Ridge Regression`)，损失函数：\n",
    "$$J(\\theta)=MSE(\\theta)+\\alpha \\frac{1}{2}\\sum_{i=1}^m\\theta_i^2$$\n",
    "    - 其中偏差项 $\\theta_0$ 不被正则化，参数 $\\alpha$ 决定了系数缩减的幅度， $\\alpha$ 值越大，缩减幅度越大，系数的值越趋近于 0\n",
    "    - 岭回归的完全解，矩阵 $A$ 为单位矩阵，除了左上角值为 0:$$\\hat{\\theta}=(X^{T}\\cdot{X}+\\alpha A)^{-1}\\cdot{X}^{T}\\cdot{y}$$\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "------------------------------\n",
    "```\n",
    "class sklearn.linear_model.Ridge(alpha=1.0, \n",
    "                                 fit_intercept=True, \n",
    "                                 normalize=False, \n",
    "                                 copy_X=True, \n",
    "                                 max_iter=None, \n",
    "                                 tol=0.001, \n",
    "                                 solver=’auto’, \n",
    "                                 random_state=None)\n",
    "```\n",
    "- 原理：\n",
    "  - 在训练线性回归模型时，有的系数取值很大，导致模型不稳定。\n",
    "  - 如果数据集包含大量关联的预测器，仅仅微小的改变都可能导致模型不稳定；此外，假定两个负相关的变量，对反应变量的影响是相反的\n",
    "  - 可以将系数矩阵引入代价函数，对权重太高的系数进行惩罚，以减小系数的值，即缩减方法。\n",
    "  - 岭回归：求得系数矩阵 $W​$，最小化目标函数 $||(y-X*W)||^2+\\alpha*||W||^2​$\n",
    "  \n",
    "- 参数：\n",
    "  - `alpha`：决定了缩减的幅度，$\\alpha$ 越大，缩减幅度越大，系数的值越接近 0 \n",
    "  - `tol`：代价函数的解的精度\n",
    "  - `max_iter,solver`：求解代价函数的求解器"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "ridge_reg = Ridge(alpha=1, solver='cholesky')\n",
    "ridge_reg.fit(X, y)\n",
    "ridge_reg.predict([[1.5]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# l2 惩罚，即岭回归\n",
    "sgd_reg = SGDRegressor(penalty='l2')\n",
    "sgd_reg.fit(X, y.ravel())\n",
    "\n",
    "sgd_reg.predict([[1.5]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "# Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>\n",
    "# License: BSD 3 clause\n",
    "\n",
    "import time\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.svm import SVR\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.model_selection import learning_curve\n",
    "from sklearn.kernel_ridge import KernelRidge\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "rng = np.random.RandomState(0)\n",
    "\n",
    "# #############################################################################\n",
    "# Generate sample data\n",
    "X = 5 * rng.rand(10000, 1)\n",
    "y = np.sin(X).ravel()\n",
    "\n",
    "# Add noise to targets\n",
    "y[::5] += 3 * (0.5 - rng.rand(X.shape[0] // 5))\n",
    "\n",
    "X_plot = np.linspace(0, 5, 100000)[:, None]\n",
    "\n",
    "# #############################################################################\n",
    "# Fit regression model\n",
    "train_size = 100\n",
    "svr = GridSearchCV(SVR(kernel='rbf', gamma=0.1),\n",
    "                   cv=5,\n",
    "                   param_grid={\n",
    "                       \"C\": [1e0, 1e1, 1e2, 1e3],\n",
    "                       \"gamma\": np.logspace(-2, 2, 5)\n",
    "                   })\n",
    "\n",
    "kr = GridSearchCV(KernelRidge(kernel='rbf', gamma=0.1),\n",
    "                  cv=5,\n",
    "                  param_grid={\n",
    "                      \"alpha\": [1e0, 0.1, 1e-2, 1e-3],\n",
    "                      \"gamma\": np.logspace(-2, 2, 5)\n",
    "                  })\n",
    "\n",
    "t0 = time.time()\n",
    "svr.fit(X[:train_size], y[:train_size])\n",
    "svr_fit = time.time() - t0\n",
    "print(\"SVR complexity and bandwidth selected and model fitted in %.3f s\" %\n",
    "      svr_fit)\n",
    "\n",
    "t0 = time.time()\n",
    "kr.fit(X[:train_size], y[:train_size])\n",
    "kr_fit = time.time() - t0\n",
    "print(\"KRR complexity and bandwidth selected and model fitted in %.3f s\" %\n",
    "      kr_fit)\n",
    "\n",
    "sv_ratio = svr.best_estimator_.support_.shape[0] / train_size\n",
    "print(\"Support vector ratio: %.3f\" % sv_ratio)\n",
    "\n",
    "t0 = time.time()\n",
    "y_svr = svr.predict(X_plot)\n",
    "svr_predict = time.time() - t0\n",
    "print(\"SVR prediction for %d inputs in %.3f s\" %\n",
    "      (X_plot.shape[0], svr_predict))\n",
    "\n",
    "t0 = time.time()\n",
    "y_kr = kr.predict(X_plot)\n",
    "kr_predict = time.time() - t0\n",
    "print(\"KRR prediction for %d inputs in %.3f s\" % (X_plot.shape[0], kr_predict))\n",
    "\n",
    "# #############################################################################\n",
    "# Look at the results\n",
    "sv_ind = svr.best_estimator_.support_\n",
    "plt.scatter(X[sv_ind],\n",
    "            y[sv_ind],\n",
    "            c='r',\n",
    "            s=50,\n",
    "            label='SVR support vectors',\n",
    "            zorder=2,\n",
    "            edgecolors=(0, 0, 0))\n",
    "plt.scatter(X[:100],\n",
    "            y[:100],\n",
    "            c='k',\n",
    "            label='data',\n",
    "            zorder=1,\n",
    "            edgecolors=(0, 0, 0))\n",
    "plt.plot(X_plot,\n",
    "         y_svr,\n",
    "         c='r',\n",
    "         label='SVR (fit: %.3fs, predict: %.3fs)' % (svr_fit, svr_predict))\n",
    "plt.plot(X_plot,\n",
    "         y_kr,\n",
    "         c='g',\n",
    "         label='KRR (fit: %.3fs, predict: %.3fs)' % (kr_fit, kr_predict))\n",
    "plt.xlabel('data')\n",
    "plt.ylabel('target')\n",
    "plt.title('SVR versus Kernel Ridge')\n",
    "plt.legend()\n",
    "\n",
    "# Visualize training and prediction time\n",
    "plt.figure()\n",
    "\n",
    "# Generate sample data\n",
    "X = 5 * rng.rand(10000, 1)\n",
    "y = np.sin(X).ravel()\n",
    "y[::5] += 3 * (0.5 - rng.rand(X.shape[0] // 5))\n",
    "sizes = np.logspace(1, 4, 7).astype(np.int)\n",
    "for name, estimator in {\n",
    "        \"KRR\": KernelRidge(kernel='rbf', alpha=0.1, gamma=10),\n",
    "        \"SVR\": SVR(kernel='rbf', C=1e1, gamma=10)\n",
    "}.items():\n",
    "    train_time = []\n",
    "    test_time = []\n",
    "    for train_test_size in sizes:\n",
    "        t0 = time.time()\n",
    "        estimator.fit(X[:train_test_size], y[:train_test_size])\n",
    "        train_time.append(time.time() - t0)\n",
    "\n",
    "        t0 = time.time()\n",
    "        estimator.predict(X_plot[:1000])\n",
    "        test_time.append(time.time() - t0)\n",
    "\n",
    "    plt.plot(sizes,\n",
    "             train_time,\n",
    "             'o-',\n",
    "             color=\"r\" if name == \"SVR\" else \"g\",\n",
    "             label=\"%s (train)\" % name)\n",
    "    plt.plot(sizes,\n",
    "             test_time,\n",
    "             'o--',\n",
    "             color=\"r\" if name == \"SVR\" else \"g\",\n",
    "             label=\"%s (test)\" % name)\n",
    "\n",
    "plt.xscale(\"log\")\n",
    "plt.yscale(\"log\")\n",
    "plt.xlabel(\"Train size\")\n",
    "plt.ylabel(\"Time (seconds)\")\n",
    "plt.title('Execution Time')\n",
    "plt.legend(loc=\"best\")\n",
    "\n",
    "# Visualize learning curves\n",
    "plt.figure()\n",
    "\n",
    "svr = SVR(kernel='rbf', C=1e1, gamma=0.1)\n",
    "kr = KernelRidge(kernel='rbf', alpha=0.1, gamma=0.1)\n",
    "train_sizes, train_scores_svr, test_scores_svr = \\\n",
    "    learning_curve(svr, X[:100], y[:100], train_sizes=np.linspace(0.1, 1, 10),\n",
    "                   scoring=\"neg_mean_squared_error\", cv=10)\n",
    "train_sizes_abs, train_scores_kr, test_scores_kr = \\\n",
    "    learning_curve(kr, X[:100], y[:100], train_sizes=np.linspace(0.1, 1, 10),\n",
    "                   scoring=\"neg_mean_squared_error\", cv=10)\n",
    "\n",
    "plt.plot(train_sizes, -test_scores_svr.mean(1), 'o-', color=\"r\", label=\"SVR\")\n",
    "plt.plot(train_sizes, -test_scores_kr.mean(1), 'o-', color=\"g\", label=\"KRR\")\n",
    "plt.xlabel(\"Train size\")\n",
    "plt.ylabel(\"Mean Squared Error\")\n",
    "plt.title('Learning curves')\n",
    "plt.legend(loc=\"best\")\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `Lasso Regression(Least Absolute Shrinkage and Selection Operator Regression)`，损失函数：\n",
    " $$J(\\theta)=MSE(\\theta)+\\alpha\\sum_{i=1}^m|\\theta_i|$$ \n",
    "     - `LASSO`回归，倾向于将不重要特征系数变为 0，因此相当于自动进行特征选择，并输出一个稀疏模型\n",
    "     - `LASSO`回归的损失函数在 $\\theta_i=0$ 处不可微，使用 `subgradient vector` $g$ 替代\n",
    "$$\\boxed{g(\\theta,J)=\\bigtriangledown_{\\theta} MSE(\\theta)+\\alpha\n",
    "\\begin{pmatrix} \n",
    "sign(\\theta_1)\\\\sign(\\theta_2)\\\\ \\vdots \\\\sign(\\theta_n)\n",
    "\\end{pmatrix}\\quad \n",
    "sign(\\theta_i)=\n",
    "\\begin{cases}\n",
    "-1 \\quad if \\quad \\theta_i<0\\\\ 0 \\quad\\ if\\quad  \\theta_i=0\\\\ +1 \\quad if \\quad \\theta_i>0\n",
    "\\end{cases}}\n",
    "$$\n",
    "\n",
    "- 原理：\n",
    "  - 与岭回归相比，结果更倾向稀疏，即大多数稀疏都为 0\n",
    "  - 对于相关联的变量。只选择保留其中一个；岭回归会给这些变量的系数分配相同的权重\n",
    "  - 求得系数矩阵 $W$，最小化目标函数 $||(y-X*W)||^2+\\alpha*||W||$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Lasso\n",
    "lasso_reg = Lasso(alpha=0.1)\n",
    "lasso_reg.fit(X, y)\n",
    "lasso_reg.predict([[1.5]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- `Elastic Net`，介于岭回归和 `LASSO`回归之间，其损失函数：\n",
    "$$J(\\theta)=MSE(\\theta)+r\\alpha\\sum_{i=1}^m|\\theta_i|+\\frac{1-r}{2}\\alpha\\sum_{i=1}^m\\theta_i^2$$ \n",
    "    - `Elastic Net` 也倾向于将无用特征的权重变为 0 \n",
    "    - 当特征数多于样本数时，或多个特征强相关时，优先使用 `Elastic Net`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import ElasticNet\n",
    "elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)\n",
    "elastic_net.fit(X, y)\n",
    "elastic_net.predict([[1.5]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 多项式回归"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "sns.set()\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 2.,  4.,  8.],\n",
       "       [ 3.,  9., 27.],\n",
       "       [ 4., 16., 64.]])"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 多项式特征\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "x = np.array([2, 3, 4])\n",
    "poly = PolynomialFeatures(3, include_bias=False)\n",
    "poly.fit_transform(x[:, None])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 多项式回归\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "poly_model = make_pipeline(PolynomialFeatures(7), LinearRegression())\n",
    "rng = np.random.RandomState(1)\n",
    "x = 10 * rng.rand(50)\n",
    "y = np.sin(x) + 0.1 * rng.randn(50)\n",
    "\n",
    "poly_model.fit(x[:, np.newaxis], y)\n",
    "yfit = poly_model.predict(xfit[:, np.newaxis])\n",
    "\n",
    "plt.scatter(x, y)\n",
    "plt.plot(x, yfit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 自定义多项式特征\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "\n",
    "\n",
    "class GaussianFeatures(BaseEstimator, TransformerMixin):\n",
    "    def __init__(self, N, width_factor=2.0):\n",
    "        self.N = N\n",
    "        self.width_factor = width_factor\n",
    "\n",
    "    @staticmethod\n",
    "    def _gauss_basis(x, y, width, axis=None):\n",
    "        arg = (x - y) / width\n",
    "        return (np.exp(-0.5 * np.sum(arg**2, axis)))\n",
    "\n",
    "    def fit(self, x, y=None):\n",
    "        self.centers_ = np.linspace(x.min(), x.max(), self.N)\n",
    "        self.width_ = self.width_factor * (self.centers_[1] - self.centers_[0])\n",
    "        return (self)\n",
    "\n",
    "    def transform(self, x):\n",
    "        return (self._gauss_basis(x[:, :, np.newaxis],\n",
    "                                  self.centers_,\n",
    "                                  self.width_,\n",
    "                                  axis=1))\n",
    "\n",
    "\n",
    "gauss_model = make_pipeline(GaussianFeatures(20), LinearRegression())\n",
    "\n",
    "gauss_model.fit(x[:, np.newaxis], y)\n",
    "yfit = gauss_model.predict(xfit[:, np.newaxis])\n",
    "\n",
    "plt.scatter(x, y)\n",
    "plt.plot(xfit, yfit)\n",
    "plt.xlim(0, 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def basis_plot(model, title=None):\n",
    "    fig, ax = plt.subplots(2, sharex=True)\n",
    "    model.fit(x[:, np.newaxis], y)\n",
    "    ax[0].scatter(x, y)\n",
    "    ax[0].plot(xfit, model.predict(xfit[:, np.newaxis]))\n",
    "    ax[0].set(xlabel='x', ylabel='y', ylim=(-1.5, 1.5))\n",
    "\n",
    "    if title:\n",
    "        ax[0].set_title(title)\n",
    "\n",
    "    ax[1].plot(model.steps[0][1].centers_, model.steps[1][1].coef_)\n",
    "    ax[1].set(xlabel='basis location', ylabel='coefficient', xlim=(0, 10))\n",
    "\n",
    "\n",
    "model = make_pipeline(GaussianFeatures(30), LinearRegression())\n",
    "basis_plot(model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 正则化-岭回归\n",
    "from sklearn.linear_model import Ridge\n",
    "model = make_pipeline(GaussianFeatures(30), Ridge(alpha=0.1))\n",
    "\n",
    "basis_plot(model, title='Ridge Regression')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 正则化-Lasso回归\n",
    "from sklearn.linear_model import Lasso\n",
    "model = make_pipeline(GaussianFeatures(30), Lasso(alpha=0.001))\n",
    "basis_plot(model, title='Lasso Regression')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
